We want to use the RStudio interface to control git. I created a RStudio Project, I wrote stuff and I hope I’ll add some nice stat graphs :).
This course looks very interesting. But, I’m really worried now because my laptop broke down since friday. I hope I can recover my data.
*update: one week later, the pc from my work is working, I recovered my data, but I do not trust it anymore :(
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Thu Nov 19 18:08:05 2020"
REPOSITORY:
“There is a pleasure in the pathless woods,
There is a rapture on the lonely shore,
There is society, where none intrudes,
By the deep Sea, and music in its roar:
I love not Man the less, but Nature more,
From these our interviews, in which I steal
From all I may be, or have been before,
To mingle with the Universe, and feel
What I can ne’er express, yet cannot all conceal.”
by Lord Byron
───▄▀▀▀▄▄▄▄▄▄▄▀▀▀▄───
───█▒▒░░░░░░░░░▒▒█───
────█░░█░░░░░█░░█────
─▄▄──█░░░▀█▀░░░█──▄▄─
█░░█─▀▄░░░░░░░▄▀─█░░█
first session: creating a data set.
data source: survey of Approaches to Learning…
Kimmo Vehkalahti: ASSIST 2014 - Phase 3 (end of Part 2), N=183 Course: Johdatus yhteiskuntatilastotieteeseen, syksy 2014 (Introduction to Social Statistics, fall 2014 - in Finnish), international survey of Approaches to Learning, made possible by Teachers’ Academy funding for KV in 2013-2015.
Dataset structure: 7 variables and 166 observations
Several questions were asked in the survey, three learning approaches resulted from the combination of similar questions (deep, surfing and strategical approach).
data<-read.csv("learning2014.csv")
str(data)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
Gender: 110 are female (F) and 56 are men(M)
Age: from 17 to 55 years old (25 is the mean)
Attitude: 3.2 (1.4-5)
Deep: 3.68 (1.583-4.917)
Strategy: 3.121 (1.250 - 5)
Surfing: 2.787 (1.583-4.333)
Points: 22.72 (7 - 33)
library(GGally)
## Warning: package 'GGally' was built under R version 3.6.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.6.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
ggpairs(data, mapping = aes(colour=gender), legend = 1,
title="Fig.1 Variables overview",
lower = list(combo = wrap("facethist",size=0.1, bins = 20, alpha=0.3)))+
theme(legend.position="bottom")
Age of the participants, deep learning approach and the points obtained in the final test are non-normal distributed (Fig.1).
LINNEAR REGRESSION MODEL
mod1<-lm(Points~Age+attitude+deep+surf+stra+gender, data=data)
summary(mod1)
##
## Call:
## lm(formula = Points ~ Age + attitude + deep + surf + stra + gender,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4506 -3.2479 0.3879 3.5243 11.4820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.30246 5.25723 3.481 0.000644 ***
## Age -0.09607 0.05396 -1.780 0.076929 .
## attitude 3.44300 0.59776 5.760 4.24e-08 ***
## deep -1.04279 0.78438 -1.329 0.185606
## surf -1.10891 0.84467 -1.313 0.191132
## stra 0.95871 0.55150 1.738 0.084083 .
## genderM -0.05858 0.92985 -0.063 0.949845
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.265 on 159 degrees of freedom
## Multiple R-squared: 0.2312, Adjusted R-squared: 0.2021
## F-statistic: 7.967 on 6 and 159 DF, p-value: 1.599e-07
attitude, age and strategy are the most relevant variables
mod2<-lm(Points~Age+attitude+stra, data=data)
summary(mod2)
##
## Call:
## lm(formula = Points ~ Age + attitude + stra, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## Age -0.08822 0.05302 -1.664 0.0981 .
## attitude 3.48077 0.56220 6.191 4.72e-09 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
The equation obtained is:
points=10.89 + 3.48attitude + 1strategy approach - 0.08*age (+ or - 5.26)
With 20.3% of accuracy we can calculate the obtained points in the final evaluation by the equation showed in model 2 (mod2), Attitude is the most important variable to get more points. The importance of this model is based on the fact that we can explain and detect the most significant variables in the learning process rather than predict the points.
#Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
par(mfrow=c(2,2))
plot(mod2)
The graphs are not fancy but important, it shows that the accuracy reported in the model is trustworthy.
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). more information: https://archive.ics.uci.edu/ml/datasets/Student+Performance
The joined data set used in the analysis exercise combines the two student alcohol consumption data sets. The following adjustments have been made:
The variables not used for joining the two data have been combined by averaging (including the grade variables) ‘alc_use’ is the average of ‘Dalc’ (diary) and ‘Walc’ (week end consumption) ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise
library(openxlsx)
## Warning: package 'openxlsx' was built under R version 3.6.3
pormath<-read.xlsx ("pormath.xlsx")
str(pormath)
## 'data.frame': 370 obs. of 51 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : num 15 15 15 15 15 15 15 15 15 15 ...
## $ address : chr "R" "R" "R" "R" ...
## $ famsize : chr "GT3" "GT3" "GT3" "GT3" ...
## $ Pstatus : chr "T" "T" "T" "T" ...
## $ Medu : num 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : num 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : chr "at_home" "other" "at_home" "services" ...
## $ Fjob : chr "other" "other" "other" "health" ...
## $ reason : chr "home" "reputation" "reputation" "course" ...
## $ guardian : chr "mother" "mother" "mother" "mother" ...
## $ traveltime: num 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : num 4 2 1 3 3 3 3 2 4 4 ...
## $ schoolsup : chr "yes" "yes" "yes" "yes" ...
## $ famsup : chr "yes" "yes" "yes" "yes" ...
## $ activities: chr "yes" "no" "yes" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "yes" "yes" "no" "yes" ...
## $ romantic : chr "no" "yes" "no" "no" ...
## $ famrel : num 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : num 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : num 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : num 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : num 1 4 1 1 3 1 2 3 3 1 ...
## $ health : num 1 5 2 5 3 5 5 4 3 4 ...
## $ n : num 2 2 2 2 2 2 2 2 2 2 ...
## $ id.p : num 1096 1073 1040 1025 1166 ...
## $ id.m : num 2096 2073 2040 2025 2153 ...
## $ failures : num 0 1 0 0 1 0 1 0 0 0 ...
## $ paid : chr "yes" "no" "no" "no" ...
## $ absences : num 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : num 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : num 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : num 12 8 12 9 12 12 6 10 16 10 ...
## $ failures.p: num 0 0 0 0 0 0 0 0 0 0 ...
## $ paid.p : chr "yes" "no" "no" "no" ...
## $ absences.p: num 4 2 8 2 2 2 0 0 6 10 ...
## $ G1.p : num 13 13 14 10 13 11 10 11 15 10 ...
## $ G2.p : num 13 11 13 11 13 12 11 10 15 10 ...
## $ G3.p : num 13 11 12 10 13 12 12 11 15 10 ...
## $ failures.m: num 1 2 0 0 2 0 2 0 0 0 ...
## $ paid.m : chr "yes" "no" "yes" "yes" ...
## $ absences.m: num 2 2 8 2 8 2 0 2 12 10 ...
## $ G1.m : num 7 8 14 10 10 12 12 8 16 10 ...
## $ G2.m : num 10 6 13 9 10 12 0 9 16 11 ...
## $ G3.m : num 10 5 13 8 10 11 0 8 16 11 ...
## $ alc_use : num 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
## $ cid : num 3001 3002 3003 3004 3005 ...
#classmate's script:
library(tidyr); library(dplyr); library(ggplot2); library(gridExtra); library(readr)
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'gridExtra' was built under R version 3.6.3
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## Warning: package 'readr' was built under R version 3.6.3
alc <- read_csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_character(),
## age = col_double(),
## Medu = col_double(),
## Fedu = col_double(),
## traveltime = col_double(),
## studytime = col_double(),
## failures = col_double(),
## famrel = col_double(),
## freetime = col_double(),
## goout = col_double(),
## Dalc = col_double(),
## Walc = col_double(),
## health = col_double(),
## absences = col_double(),
## G1 = col_double(),
## G2 = col_double(),
## G3 = col_double(),
## alc_use = col_double(),
## high_use = col_logical()
## )
## i Use `spec()` for the full column specifications.
data <- mutate(alc, ID = row_number()) %>%
select(any_of(c("ID", "high_use", "absences", "health", "freetime", "famrel")))
p1 <- ggplot(data, aes(absences)) + stat_count(geom="line", aes(colour=high_use) )
p2 <- ggplot(data, aes(health)) + geom_bar(aes(fill=high_use), position="dodge")
p3 <- ggplot(data, aes(freetime)) + geom_bar(aes(fill=high_use), position="dodge")
p4 <- ggplot(data, aes(famrel)) + geom_bar(aes(fill=high_use), position="dodge")
grid.arrange(p1, p2, p3, p4, nrow=2, ncol=2)
date()
## [1] "Thu Nov 19 18:08:22 2020"
The data set this week is about: Housing values in suburbs of Boston (source:https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html) there are several variables that affect the housing values such as: criminality, infrastructure, density, size of the house…
crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centres.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $10,000.
ptratio: pupil-teacher ratio by town.
black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat: lower status of the population (percent).
medv: median value of owner-occupied homes in $1000s.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.3
## corrplot 0.84 loaded
library(ggplot2)
library(GGally)
library (plotly)
## Warning: package 'plotly' was built under R version 3.6.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
data("Boston")
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
cor<-cor(Boston, method = "pearson", use = "complete.obs")
round(cor, 2)
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
corrplot.mixed(cor, lower = "ellipse", upper="number", tl.col = "black", tl.srt = 45)
Bigger houses are more expensive (number of rooms r=0.7), the house value is lower for lower status people (r=0.74). There are intercorrelation between the variables that can explain the houses values.
Here we subtract the column means from the corresponding columns and divide the difference with standard deviation, all the variables have a mean=0.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
creating a quantile vector of crime
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Dividing the dataset to train and test sets, so that 80% of the data belongs to the train set.
ncol <- nrow(boston_scaled)
ind <- sample(ncol, size = ncol * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
Fitting the linear discriminant analysis (LDA) and LDA (bi)plot.
LDA is a classification (and dimension reduction) method. It finds the (linear) combination of the variables that separate the target variable classes. The target can be binary or multiclass variable.
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 3, arrow_heads = 0.1, color = "orange", tex = 1.25, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)}# the function for lda biplot arrows (datacamp)
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 3)
The plot shows that there are more crimes in areas with close access to the highway and in the residential zones, as showed in the correlation matrix.
correct_classes<-test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test) #Test=the rest=20%
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 10 0 0
## med_low 1 16 4 0
## med_high 2 9 13 0
## high 0 0 0 33
Similar to the LDA, more crimes are expected in high and low class neighbourhoods.
The optimal number of clusters is -> 2
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
dis<-dist(boston_scaled)
summary(dis)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
set.seed(123)
k_max <- 20
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})# calculate the total within sum of squares
qplot(x = 1:k_max, y = twcss, geom = 'line')
Visualization of the clusters per variable:
km <-kmeans(boston_scaled, centers = 2)
ggpairs(boston_scaled, mapping = aes(colour=as.factor(km$cluster)), legend = 1,
upper = list(continuous =wrap("cor", size=3)),
title="clusters overview",
lower = list(combo = wrap("facethist",size=0.1, bins = 20, alpha=0.3)))+
theme(legend.position="bottom")
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
km2 <-kmeans(boston_scaled, centers= 3)
boston_scaled$km_cluster<-km2$cluster
lda.fit2 <- lda(km_cluster ~ ., data = boston_scaled)# linear discriminant analysis
lda.fit2
## Call:
## lda(km_cluster ~ ., data = boston_scaled)
##
## Prior probabilities of groups:
## 1 2 3
## 0.3241107 0.4664032 0.2094862
##
## Group means:
## crim zn indus chas nox rm
## 1 0.8046456 -0.4872402 1.117990 0.01575144 1.1253988 -0.46443119
## 2 -0.3760908 -0.3417123 -0.296848 0.01127561 -0.3345884 -0.09228038
## 3 -0.4075892 1.5146367 -1.068814 -0.04947434 -0.9962503 0.92400834
## age dis rad tax ptratio black
## 1 0.79737580 -0.85425848 1.2219249 1.2954050 0.60580719 -0.6407268
## 2 -0.02966623 0.05695857 -0.5803944 -0.6030198 -0.08691245 0.2863040
## 3 -1.16762641 1.19486951 -0.5983266 -0.6616391 -0.74378342 0.3538816
## lstat medv
## 1 0.8719904 -0.68418954
## 2 -0.1801190 0.03577844
## 3 -0.9480974 0.97889973
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim 0.03134296 0.14880455
## zn 0.06381527 1.22350515
## indus -0.61086696 0.10402980
## chas -0.01953161 -0.03579238
## nox -1.00230143 0.70464917
## rm 0.16285767 0.44390394
## age 0.07220634 -0.59785382
## dis 0.04270475 0.45498614
## rad -0.71987743 0.02882054
## tax -0.98285440 0.70663319
## ptratio -0.22527977 0.15514668
## black 0.01693595 -0.03181845
## lstat -0.18274033 0.50122677
## medv 0.02892966 0.64244841
##
## Proportion of trace:
## LD1 LD2
## 0.8409 0.1591
lda biplot
classes <- as.numeric(boston_scaled$km_cluster)# target classes as numeric
plot(lda.fit2, dimen = 2, col = classes, pch = classes)# plot the lda results
lda.arrows(lda.fit, myscale = 3)
There differences between the clusters, cluster 3 is better explain by rad and tax variables, while cluster one by age.
creating K-means for the training data color by crime insidence
model_predictors <- dplyr::select(train, -crime)
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling # matrix multiplication
matrix_product <- as.data.frame(matrix_product)# matrix multiplication
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3,
color=train$crime, type= 'scatter3d', mode='markers')
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Creation of 2 clusters:
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
train <- boston_scaled[ind,]
str(train)
## 'data.frame': 404 obs. of 14 variables:
## $ crim : num -0.342 -0.416 0.0708 -0.4003 -0.4136 ...
## $ zn : num -0.4872 2.9429 -0.4872 0.0487 2.5142 ...
## $ indus : num -0.437 -0.902 1.015 -0.476 -1.297 ...
## $ chas : num -0.272 -0.272 3.665 -0.272 -0.272 ...
## $ nox : num -0.144 -1.24 1.858 -0.265 -1.335 ...
## $ rm : num -0.671 0.82 -0.685 -0.399 1.076 ...
## $ age : num 0.772 -1.445 0.726 0.615 -2.081 ...
## $ dis : num 0.421 0.628 -0.898 1.328 1.915 ...
## $ rad : num -0.637 -0.637 1.66 -0.522 -0.522 ...
## $ tax : num -0.601 -0.969 1.529 -0.577 -0.298 ...
## $ ptratio: num 1.175 0.344 0.806 -1.504 -1.689 ...
## $ black : num 0.2213 0.4406 -0.0398 0.329 0.1633 ...
## $ lstat : num 0.302 -1.306 0.278 0.623 -1.108 ...
## $ medv : num -0.645 0.649 -0.623 -0.395 0.703 ...
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
km3 <-kmeans(train, centers= 2)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3,
color=as.factor(km3$cluster), type= 'scatter3d', mode='markers')
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Creation of 3 clusters:
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
train <- boston_scaled[ind,]
km4 <-kmeans(train, centers= 3)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3,
color=as.factor(km4$cluster), type= 'scatter3d', mode='markers')
date()
## [1] "Thu Nov 19 18:09:12 2020"